16 research outputs found

    Temporal Sentence Grounding in Streaming Videos

    Full text link
    This paper aims to tackle a novel task - Temporal Sentence Grounding in Streaming Videos (TSGSV). The goal of TSGSV is to evaluate the relevance between a video stream and a given sentence query. Unlike regular videos, streaming videos are acquired continuously from a particular source, and are always desired to be processed on-the-fly in many applications such as surveillance and live-stream analysis. Thus, TSGSV is challenging since it requires the model to infer without future frames and process long historical frames effectively, which is untouched in the early methods. To specifically address the above challenges, we propose two novel methods: (1) a TwinNet structure that enables the model to learn about upcoming events; and (2) a language-guided feature compressor that eliminates redundant visual frames and reinforces the frames that are relevant to the query. We conduct extensive experiments using ActivityNet Captions, TACoS, and MAD datasets. The results demonstrate the superiority of our proposed methods. A systematic ablation study also confirms their effectiveness.Comment: Accepted by ACM MM 202

    EVE: Efficient zero-shot text-based Video Editing with Depth Map Guidance and Temporal Consistency Constraints

    Full text link
    Motivated by the superior performance of image diffusion models, more and more researchers strive to extend these models to the text-based video editing task. Nevertheless, current video editing tasks mainly suffer from the dilemma between the high fine-tuning cost and the limited generation capacity. Compared with images, we conjecture that videos necessitate more constraints to preserve the temporal consistency during editing. Towards this end, we propose EVE, a robust and efficient zero-shot video editing method. Under the guidance of depth maps and temporal consistency constraints, EVE derives satisfactory video editing results with an affordable computational and time cost. Moreover, recognizing the absence of a publicly available video editing dataset for fair comparisons, we construct a new benchmark ZVE-50 dataset. Through comprehensive experimentation, we validate that EVE could achieve a satisfactory trade-off between performance and efficiency. We will release our dataset and codebase to facilitate future researchers

    Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning

    Full text link
    In recent years, the explosion of web videos makes text-video retrieval increasingly essential and popular for video filtering, recommendation, and search. Text-video retrieval aims to rank relevant text/video higher than irrelevant ones. The core of this task is to precisely measure the cross-modal similarity between texts and videos. Recently, contrastive learning methods have shown promising results for text-video retrieval, most of which focus on the construction of positive and negative pairs to learn text and video representations. Nevertheless, they do not pay enough attention to hard negative pairs and lack the ability to model different levels of semantic similarity. To address these two issues, this paper improves contrastive learning using two novel techniques. First, to exploit hard examples for robust discriminative power, we propose a novel Dual-Modal Attention-Enhanced Module (DMAE) to mine hard negative pairs from textual and visual clues. By further introducing a Negative-aware InfoNCE (NegNCE) loss, we are able to adaptively identify all these hard negatives and explicitly highlight their impacts in the training loss. Second, our work argues that triplet samples can better model fine-grained semantic similarity compared to pairwise samples. We thereby present a new Triplet Partial Margin Contrastive Learning (TPM-CL) module to construct partial order triplet samples by automatically generating fine-grained hard negatives for matched text-video pairs. The proposed TPM-CL designs an adaptive token masking strategy with cross-modal interaction to model subtle semantic differences. Extensive experiments demonstrate that the proposed approach outperforms existing methods on four widely-used text-video retrieval datasets, including MSR-VTT, MSVD, DiDeMo and ActivityNet.Comment: Accepted by ACM MM 202

    Automatic Car Damage Assessment System: Reading and Understanding Videos as Professional Insurance Inspectors

    No full text
    We demonstrate a car damage assessment system in car insurance field based on artificial intelligence techniques, which can exempt insurance inspectors from checking cars on site and help people without professional knowledge to evaluate car damages when accidents happen. Unlike existing approaches, we utilize videos instead of photos to interact with users to make the whole procedure as simple as possible. We adopt object and video detection and segmentation techniques in computer vision, and take advantage of multiple frames extracted from videos to achieve high damage recognition accuracy. The system uploads video streams captured by mobile devices, recognizes car damage on the cloud asynchronously and then returns damaged components and repair costs to users. The system evaluates car damages and returns results automatically and effectively in seconds, which reduces laboratory costs and decreases insurance claim time significantly

    Sperm cells are passive cargo of the pollen tube in plant fertilization

    No full text
    Sperm cells of seed plants have lost their motility and are transported by the vegetative pollen tube cell for fertilization, but the extent to which they regulate their own transportation is a long-standing debate. Here we show that Arabidopsis lacking two bHLH transcription factors produces pollen without sperm cells. This abnormal pollen mostly behaves like the wild type and demonstrates that sperm cells are dispensable for normal pollen tube development

    Maternal ENODLs Are Required for Pollen Tube Reception in Arabidopsis

    No full text
    During the angiosperm (flowering-plant) life cycle, double fertilization represents the hallmark between diploid and haploid generations [1]. The success of double fertilization largely depends on compatible communication between the male gametophyte (pollen tube) and the maternal tissues of the flower, culminating in precise pollen tube guidance to the female gametophyte (embryo sac) and its rupture to release sperm cells. Several important factors involved in the pollen tube reception have been identified recently [2-6], but the underlying signaling pathways are far from being understood. Here, we report that a group of female-specific small proteins, early nodulin-like proteins (ENODLs, or ENs), are required for pollen tube reception. ENs are featured with a plastocyanin-like (PCNL) domain, an arabinogalactan (AG) glycomodule, and a predicted glycosylphosphatidylinositol (GPI) anchor motif. We show that ENs are asymmetrically distributed at the plasma membrane of the synergid cells and accumulate at the filiform apparatus, where arriving pollen tubes communicate with the embryo sac. EN14 strongly and specifically interacts with the extracellular domain of the receptor-like kinase FERONIA, localized at the synergid cell surface and known to critically control pollen tube reception [6]. Wild-type pollen tubes failed to arrest growth and to rupture after entering the ovules of quintuple loss-of-function EN mutants, indicating a central role of ENs in male-female communication and pollen tube reception. Moreover, overexpression of EN15 by the endogenous promoter caused disturbed pollen tube guidance and reduced fertility. These data suggest that female-derived GPI-anchored ENODLs play an essential role in male-female communication and fertilization.Natural Science Foundation of China [31230006, 31370344]; National Basic Research Program of China [2012CB944801]; German Research Council (DFG Collaborative Research Center) [SFB924]SCI(E)[email protected]
    corecore